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ABSTRACT 

In this paper, we focus on the need for two approaches to 
optimize producer and consumer synchronization for auto- 
parallelizing compiler. Emphasis is placed on the construc- 
tion of a criterion model by which the compiler reduce the 
number of synchronization operations needed to synchro- 
nize the dependence in a loop and perform optimization re- 
duces the overhead of enforcing all dependence. In accor- 
dance with our study, we transform to modify and eliminate 
dependence on iteration space diagram (ISD), and carry out 
the problems of acyclic and cyclic dependence in detail, we 
eliminate partial dependence and optimize the synchronize 
instructions. Some didactic examples are included to illus- 
trate the optimize procedure. 
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1 Introduction 

During the past decade, the field of compiling for paral- 
lel architecture has exploded with widespread commercial 
availability of multicore processors |1]|2|. Research has 
focused on several goals, the major concern being sup- 
port for auto-parallelizing. The goal of auto-parallelizing 
is compiling an invariant and unannotated sequential pro- 
gram into a parallel program 0. 

Although in recent years most attention has been 
given to support for languages with parallel annotations 
(i.e. OpenMP [4] allow programmer to manually hint 
compiler about parallel regions.), the parallelization of 
legacy code still has a profound historical significance. 
The Parafrase system Q is the first automatic parallelize 
compiler based on dependence analysis, which was devel- 
oped at the University of Illinois. The most ambitious for 
parafrase was to find out how to develop architecture to ex- 
ploit the latent parallelism in off-the-shelf dusty deck pro- 
grams 0. By using producer/consumer synchronization 
(e.g. the Alliant F/X8 [ 8 ] | 9 ] implemented synchronization 
instructions), this ordering can be forced on the program 
execution, allowing parallelism to be extracted from loops 
with dependence. 

In this paper, We focus on the parallelization of legacy 
code and optimizing producer/consumer synchronization 
via two approaches in auto-parallelizing compiler. We pro- 
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Figure 1 . High-level structure of a parallel compiler 



ceed as follows. First, in section 2, we present the compiler 
fundamentals and the target architecture. In order to un- 
derstand the latter section, we introduce some concepts of 
auto-parallelizing compiler so as to be acquainted with jar- 
gons. In additional, for clarity and brevity are served by 
directing the discussion towards a single architecture. In 
section 3, in order to understand how parallelism can be 
extracted from cyclic loops using producer/consumer syn- 
chronization, we must discuss how to extract parallelism 
when the dependence graph may be cyclic and loop freez- 
ing cannot be used to break the cycles. In section 4, we 
show how to reduce and optimize the number of synchro- 
nization instructions used to synchronize a loop. 

2 The Compiler Fundamentals and Target 
Computer 

In order to relieve programmers from the tedious and error- 
prone manual parallelize process, the compiler need auto- 
matic convert sequential code into multi-threaded or vec- 
torization code to utilize multiple processors simultane- 
ously in a shared-memory multiprocessors machine. 

2.1 Automatic Parallelize Compiler Fundamentals 

The high level flow of a compiler is shown in Figure [T] The 
actual phases of the compiler are shown as the centre, as 
well as inputs and intermediate files are shown as rounded 
boxes. 

In fact, the source program may be a binary file, used 
in binary instruments and binary compilers ifTTTl . In gen- 
eral, a Java or Python source-to-byte code compiler would 
convert the binary file to the bytecode file which contains 



analysis information for the compilation unit included, 
to the further dependence analysis on a compilation unit. 
A compilation unit is lexically analyzed and parsed by 
the compiler. The lexical analysis and parsing are not 
studied in this paper. A discussion of detailed techniques 
for compiler can be found in (\2\ (e.g. regular expres- 
sion, deterministic finite automata, non-deterministic finite 
automata). The result of the parser is an intermediate 
representation (IR), which is regarded as an abstract 
syntax tree and a graphical representation of the parsed 
program. We will modify this slightly and represent pro- 
grams as a control flow graph (CFG). In a control flow 
graph, each node bi G B is basic block. There are, in most 
presentations, two specially designated blocks: the entry 
block, through which control enters into the flow graph, 
and the exit block, through which all control flow leaves. 
Where an edge bi — » bj means that bi may execute directly 
before bj . In additional, A CFG are sometimes converted 
to static single assignment (SSA) form [13 ]. 

Dependence analysis determines whether or not it is 
safe to reorder or parallel statements. In general, control 
dependence (SiS c S 2 ) is a situation in which a program's 
instruction executes if the previous instruction evaluates 
in a way that allows its execution. A data dependence 
S 2 , S 1 6 a S 2 , S 1 6°S 2 , S 1 S i S 2 ) arises from two state- 
ments which access or modify the same resource |7 ]. Loop 
dependence analysis is mostly done to find ways to do 
auto-parallelizing, which is the task of determining whether 
statements within a loop body form a dependence, with re- 
spect to array access and modification, induction, reduction 
and private variables, simplification of loop-independent 
code and management of conditional branches inside the 
loop body. 

2.2 Shared Memory Multiprocessors Machine 

In order to clarity and brevity, the target computer assumed 
throughout this paper is a shared memory multiprocessor. 
In these systems, the processing elements can access any 
of the global memory modules through an interconnection 
network and code executes serially on each processor, and 
parallelism is realized by the simultaneous execution of dif- 
ferent iterations of a loop on different processors. In the 
shared memory version of the program, each thread exe- 
cutes a subset of the iteration space of a parallel loop. The 
Cartesian space define slightly the boundary of the loop 
for the loop's iteration space. In Figure |2] an example of 
scheduling and execution of a shared memory program is 
shown. However, all large machines for high-performance 
numerical computing have a physically distributed mem- 
ory architecture. The distributed memory machines consist 
of nodes connected to one another by using Ethernet or a 
variety proprietary interfaces. 

Here, We presented a short, informal discussion of 
compiler fundamentals and shared memory multiproces- 
sor machine. The interested reader will find a more com- 
plete discussion in [12] [14]. In the latter section, the details 
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Figure 2. Scheduling and execution of a shared memory 
program 



of producer/consumer synchronize optimizations would be 
discussed in this paper. 

3 Acyclic and Cyclic Dependence Analysis 

Most of the transformations in this paper are based on the 
concept of dependence between statements. In a sequential 
program, the statement instance is flow dependence 
on the statement instance S l a (S l a S^ S() if S l a assigns a 
value to a variable that may later be read by S 3 b . S 3 h is 
antidependence on S l a (S z a 5 a S 3 b ) if S z a fetches from a 
variable that may be later written by S 3 h . S 3 h is output 
dependence on S z a (S z a S S J b ) if S l a modifies a variable that 
may be later modified by S 3 h . S 3 h is control dependence 
on S l a (S l a 8 c S 3 h ) if S % a is control construct, and whether 
executes or not depends on the outcome of S l a . The more 
detailed discussion can be found in 031 |fT6l . 

In order to parallel loops with acyclic and cyclic de- 
pendence graphs, Samuel P. Midkiff summarized the fol- 
lowing steps will be performed ifTTl . A dependence graph 
would be constructed for the loop nest; Find strongly con- 
nected components (SCC) formed by cycles of dependence 
in the graph, contract the nodes in the SCC into a sin- 
gle large node; (Note: a directed graph is called compo- 
nents of strongly connected if there is a path from each 
vertex in the graph to every other vertex.) Mark all nodes 
in the graph containing a single statement as parallel; All 
inter-node dependence are lexically forward via topolog- 
ically sort; Group independent, unordered, nodes reading 
the same data and marked as parallel into new nodes to 
optimize data reuse; Carry out loop fission to constitute a 
new loop for each node; Mark as parallel all loops resulting 
from nodes whose statements are marked as parallel in the 
sorted graph; 

These steps will be explained in detail by means of an 
example in the remainder of this section. 



3.1 Parallelizing Loops with Acyclic 



A program with the dependence graph for a loop, as shown 
in Alg[T] The acyclic dependence graph for the program 
is illustrated in Fig [3] (a). The A defines the dependence 
distance (e.g. given a dependence S l a SS J b between in- 
stances, A = j—i). The node at the tail of a dependence arc 
is the dependence source(5 a ), and at the head of the arc is 
the dependence sink (65). In order to topologically sorting 
the dependence graph, all dependence must be lexically 
forward (A >= 0. i.e. in branchless code the sink of the 
dependence is lexically forward of the source of the depen- 
dence). The canonical application of topological sorting is 
in scheduling a sequence of jobs or tasks based on their de- 
pendencies. A topological ordering is possible if and only 
if the graph has no directed cycles, that is, if it is a directed 
acyclic graph (DAG). Any DAG has at least one topologi- 
cal ordering, and the algorithm are known for constructing 
a topological ordering of any DAG in linear time. The more 
detailed algorithm can be found in fT8ll . 



Algorithm 1 A program with dependence, 
for i = 1; i < n; i + + do 

51 :a[i) <-6[i-l] + ...; 

52 : b[i] <- c[i - 1] + 

53 : ... <- a[i - l] + b[i]*d[i - 2]; 

54 : d[i] <r- b[i - 2] - 
end for 



Simultaneously, since code executes serially on a 
given processor, and therefore within an iteration of a loop, 
only dependence with a distance greater than zero (A >= 
0) need to be synchronized explicitly. 

After the topological sorted, the dependence graph 
Fig[3](a) is transformed to the Fig|3](b). There are several 
possible ordering of the nodes resulting from a topological 
sort, that's one valid order. After that, the loop can be fully 
parallelized by breaking up the loop with dependence into 
multiple loops, none of which contain the source and sink 
of a loop carried (cross -iteration) dependence. The loop 
is transformed by reordering the statements to match the 
topological sort order, just like Alg|2] 

The program Alg|2]is a more efficient parallelization 
that can be performed by a different partitioning of state- 
ments among loops that is still consistent with the ordering 
implied by the topological sort. In additional, the more ef- 
ficient partitioning keeps statements that are not related by 
a loop-carried dependence together in the same loop. It 
called loop fission (also called loop distribution in the 
literature fT9l ). Acyclic portions of the dependence graph 
may be sorted so that dependence are lexically forward, 
with a legal fission then being possible. In the program 
Alg|2j SI and S4 can remain in the same loop which is 
no loop-carried dependence. That is, the program with a 
statement ordering yielding slightly better locality, just like 
AlgJ3] 




(a) (b) 



Figure 3. (a) The dependence graph for the program, 
(b) The dependence graph after it has been topologically 
sorted. 



Algorithm 2 The program is transformed to reflect the or- 
der of the topologically sorted dependence graph 



for parallel i = 1; i < n; i 


+ + do 


52 : b[i] <r- c[i - 1] + 




end for 




for parallel i = 1; i < n\ i 


+ + do 


51 : a[i] <r- b[i - 1] + ... 




end for 




for parallel i = 1; i < n; i 


+ + do 


54 : d[i] <- b[i - 2] - ... 




end for 




for parallel i = 1; i < n\ i 


+ + do 


53 : ... <r- a[i - 1] + b[i] 


*d[i - 2]; 


end for 





Algorithm 3 The program is transformed to reflect the or- 
der of the topologically sorted dependence graph and loop 
fission 

(invariant)... 

for parallel z = l;i<n;i + + do 

51 :a[i) 1] + ...; 

54 : d[i] <- b[i - 2] - 
end for 

(invariant)... 




Figure 4. (a) A dependence graph with SCC contracted into 
nodes, (b) A pipelined execution of the SCC across three 
threads. 



3.2 Parallelizing Loops with cyclic 

Cyclic dependence graphs with at least one loop-carried de- 
pendence, and the statement will form a SCC in the depen- 
dence graph. The most straightforward way to deal with 
the statement in each SCC is to place in a loop that is ex- 
ecuted sequentially. Another way of extracting parallelism 
from these loops is to execute the SCC in a pipelined fash- 
ion. An example of this is shown in Figj4j This is called 
decoupled software pipelining, and is described in de- 
tail in 1201 . 

In latter section 4, we show how parallelism can 
sometimes be extracted from these loops using producer — 
consumer synchronization, and optimizing producer — 
consumer synchronization. 

4 Optimizing Synchronization Algorithm 

There is no guarantee the order that parallel program exe- 
cute on the different threads will enforce the dependence. 
However, by using producer/consumer synchronization, 
this ordering can be forced on the program execution, al- 
lowing parallelism to be extracted from loops with depen- 
dence. 

In the 1980s and early 1990s, several forms of pro- 
ducer/consumer synchronization were implemented (e.g. 
full/empty synchronization, implemented in the Denel- 
cor HEP Ell). The Alliant F/X 8 implemented 
the advance(r^ i) and await(r, i) synchronization instruc- 
tions. In 1987, Samuel P. Midkiff discussed the compiler 
algorithms for synchronization l22l . He explained with, 
quit, test, testset, wait, and set instructions in detail. 

In this section, compiler exploitation of both 
of these synchronization instruction, and general pro- 
ducer/consumer synchronization, can be discussed in 
terms of send and wait synchronization. The 
wait(regs, z, vars) waits until the value of regs is i. The 
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Figure 5. the iteration space of the loop of Alg|4] 



send(regs, i, vars) writes the value i to regs, where i is 
the loop index variable, regs is the synchronization reg- 
ister used for dependence S, and vars contains the vari- 
ables involved whose dependence is being synchronized. 
The send and wait instructions also have a functional- 
ity equivalent to a fence instruction, which would ensure 
that result of all memory accesses before the send and 
wait are visible before the send or wait competes, and 
the hardware doesn't move instructions past the synchro- 
nization operation at run time. 

4.1 Insert Synchronize Instruction Set 

Due to the dependence graph, a compiler can synchronize 
a program directly. In order to a deep understanding, there 
is an example of using producer/comsumer synchroniza- 
tion, and the program is simplified as Alg|4] If you observe 
keenly, it's easy to find out the dependence graph for the 
program (i.e. 6*, A a = 1; 6*, A b = 2; 6*, A c = 1). 



Algorithm 4 A loop with cross-iteration dependence, 
for i = 1; i < n; i + + do 

51 :a[i) <-6[i-l] + ...; 

52 : b[i] <- c[i - 1] + 

53 : c[i] <- b[i - 2] + a[i - 1]; 
end for 



When we know the dependence distance from the de- 
pendence graph, the iteration space of the loop of the pro- 
gram can be illustrated in Figure |5] 

The iteration space can make ensure the location of 
the synchronize instructions. As you see, the green dotted 



line denotes the A a = 1, the brown dotted line denotes 
the Sf,Ab — 2, and the solid line denotes S^A C = 1. 
After the source of dependence 5, it inserts the instruction 
send(regs$,i,vars). Before each dependence sink, the 
compiler inserts the instruction wait(regs5, i — dj,vars), 
where d{ is the distance of the dependence on the i loop. 
The loop of the program synchronized with send/ wait 
synchronization has be shown in Alg[5] 

Algorithm 5 A loop of the program synchronized with 
send/ wait synchronization, 
for i = 1; i < n; i + + do 

51 :a[i] <- b[i - 1] + 
send(0, i, a); 
wait(2, i-1, c); 

52 : b[i] <- c[i - 1] + 
send(l, i, b); 
wait(l, i-2, b); 
wait(0, i-1, a); 

S3:c[i] <-b[i-2]+a[i-l]] 
send(2, i, c); 
end for 



The reasons that producer/consumer synchronization 
instructions aren't supported in hardware anymore shows 
that impact that technology and economics dependent on 
what is a desirable architectural |23l . Specialized synchro- 
nizing instructions fell out of favor because of the increased 
latencies required when synchronizing across the system 
bus between general purpose processors, and because the 
RISC principles of instruction set design l24l favored sim- 
pler instructions from which send and wait instructions 
could be built, albeit at a higher run time cost. Except for 
questions of profitability, the compiler strategy for inserting 
and optimizing synchronization is indifferent to whether it 
is implement in software or hardware. These optimizes will 
be explained in detail in the remainder of this section. 

4.2 Two Approaches to Optimize Synchronization 

Sometimes a compiler may reduce the number of synchro- 
nization operations needed to synchronize the dependence 
in a loop. However, all dependence must be enforced, So 
this optimization reduces the overhead of enforcing them 
by allowing a single send/wait pair to synchronize more 
than one dependence, or a combination of send/ wait in- 
structions to synchronize additional dependence. There is 
a loop with dependence to be synchronized in Alg|6] 

Algorithm 6 A loop with dependence to be synchronized, 
for i = 1; i < n; i + + do 

51 : a[i] ^— ...; 

52 : b[i] <- c[i - 1] + 
S3:c[i] <-a[i-2]] 

end for 



The loop with two dependence, include 8? , A a = 2 
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Figure 6. the ISD for the loop of Alg(6] 



and 5f,A c = 1. The iterationspacediagram(lSD) of 
Figure |6] shows the dependence to be enforced as the blue 
solid lines or the green dashed lines, and execution orders 
implied by the sequential execution of the program by the 
brown dashed lines. The section outlined with dotted box 
is representative of a section of the ISD that is examined 
by the algorithm of |[T0l that eliminates dependence using 
transitive reduction. 

Let Sj (k) represent the instance of statement Sj in it- 
eration i = k. Consider the dependence with distance two 
from statement 5i in iteration i = 2 to statement 53 in it- 
eration i = 4. There is a path 5i(2) 5 2 (2) 5 3 (2) -> 
5 2 (3) 5 3 (3) 5 2 (4) -> 5 3 (4) from 5i in iteration 2 to 
5 3 in iteration 4, just like the black lines in the dotted box. 
If the dependence from 5 3 to 5 2 has been synchronized, 
then the existence of this path of enforced orders implies 
that the dependence from 5i(2) to 5 3 (4) is also enforced. 
Due to the distances are constant, the iteration space can 
be covered by shifting the region in the dashed lines, So 
every instance of the dependence within the iteration space 
is synchronized. Samuel P. Midkiff had already shown that 
perform a transitive reduction on the ISD [10]. It's possi- 
ble for multiple dependence to work together to eliminate 
another dependence. The transitive reduction is performed 
on the ISD, which needs to only contain a subset of the 
total iteration space (i.e. the case as shown by the dotted 
box in Figure |6]). For each loop in the loop nest over which 
the synchronization elimination is taking place, the number 
of iterations needed in the ISD for the loop is equal to the 
least product of the unique prime factors of the dependence 
distance, plus one. 

Another synchronization elimination approach l25l is 
based on pattern matching and works even if the depen- 
dence distance are not constant. The matched patterns iden- 
tify dependence whose lexical relationship and distance are 
such that synchronizing one dependence will synchronize 
the order by forming a path as shown in Figure [6] (i.e. the 
black lines in the dotted box). In the program of Alg|6| let 
the forward dependence with a distance of two that is to be 
eliminated be S e , and the backward dependence of distance 



one be Si that is used to be eliminated the other dependence 
be S r . There is one pattern as follows: 

i A path from the source of S e to the source of some S r . 

ii The sink of S r reaches the sink of S e . 

iii S r is lexically backward (i.e. the sink precedes the 
source in the program flow). 

iv The absolute value of the distance of S r is one. 

v The signs of the distances of S e and S r are the same, 
then S e can be eliminated. 

The conditions of i and ii establish the proper flow of 
S e and S r , the iii recognizes that S r can be repeatedly exe- 
cuted to reach all iterations that are multiple of the distance 
away from the source. The iv and v show that because the 
absolute value of the distance is one and the signs of the 
two distances are equal the traversal enabled by the iii will 
reach the source of S e . 

5 Conclusion 

We have studied the way of the send and wait instruc- 
tions to synchronize loops. We have given general strate- 
gies for treating branches within a loop being synchro- 
nized, and present two approaches to reduce and optimize 
the number of producer/consumer synchronization instruc- 
tions in the shared-memory multiprocessors machine. 

In general, when synchronized the version of parallel 
program, there are four steps need to be enforced. First, 
a dependence graph is illustrated with respect to the pro- 
gram. Second, depending on the structure of the depen- 
dence graph and the relative costs of the different synchro- 
nization methods on a target machine, Picking a synchro- 
nization method to synchronize the loop. Third, synchro- 
nize instructions are inserted, and it makes sure that the 
cross-iteration dependence can be enforced. Finally, elim- 
inating partial dependence and optimizing the synchronize 
instructions. 

Auto-parallelizing compiler can perform all of these 
steps automatically, which relieve programmers from the 
tedious and error-prone manual parallel process. 
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